10 research outputs found

    Adapting Sequence to Sequence models for Text Normalization in Social Media

    Full text link
    Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot explicitly handle noise found in short online posts. Moreover, the variety of frequently occurring linguistic variations presents several challenges, even for humans who might not be able to comprehend the meaning of such posts, especially when they contain slang and abbreviations. Text Normalization aims to transform online user-generated text to a canonical form. Current text normalization systems rely on string or phonetic similarity and classification models that work on a local fashion. We argue that processing contextual information is crucial for this task and introduce a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for NLP applications to adapt to noisy text in social media. Our character-based component is trained on synthetic adversarial examples that are designed to capture errors commonly found in online user-generated text. Experiments show that our model surpasses neural architectures designed for text normalization and achieves comparable performance with state-of-the-art related work.Comment: Accepted at the 13th International AAAI Conference on Web and Social Media (ICWSM 2019

    FixEval: Execution-based Evaluation of Program Fixes for Programming Problems

    Full text link
    The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing bugs. Various approaches are explored in the literature to generate fixes for buggy code automatically. However, few tools and datasets are available to evaluate model-generated fixes effectively due to the large combinatorial space of possible fixes for a particular bug. In this work, we introduce FIXEVAL, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. FIXEVAL is composed of a rich test suite to evaluate and assess the correctness of model-generated program fixes and further information regarding time and memory constraints and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baselines and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FIXEVAL provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced.\footnote{\url{https://github.com/mahimanzum/FixEval}

    Rationalization for Explainable NLP: A Survey

    Get PDF
    Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007-2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities

    Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

    No full text
    Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured. In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space. Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks. Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks. Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer. Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training. We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work. Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.U of I OnlyAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD syste

    Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

    No full text
    Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured. In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space. Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks. Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks. Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer. Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training. We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work. Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.U of I OnlyAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD syste

    Drink Bleach or Do What Now? Covid-HeRA: A Study of Risk-Informed Health Decision Making in the Presence of COVID-19 Misinformation

    Full text link
    Given the widespread dissemination of inaccurate medical advice related to the 2019 coronavirus pandemic (COVID-19), such as fake remedies, treatments and prevention suggestions, misinformation detection has emerged as an open problem of high importance and interest for the research community. Several works study health misinformation detection, yet little attention has been given to the perceived severity of misinformation posts. In this work, we frame health misinformation as a risk assessment task. More specifically, we study the severity of each misinformation story and how readers perceive this severity, i.e., how harmful a message believed by the audience can be and what type of signals can be used to recognize potentially malicious fake news and detect refuted claims. To address our research questions, we introduce a new benchmark dataset, accompanied by detailed data analysis. We evaluate several traditional and state-of-the-art models and show there is a significant gap in performance when applying traditional misinformation classification models to this task. We conclude with open challenges and future directions.Comment: Accepted to AAAI ICWSM'22 Datasets Trac
    corecore